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This  paper  describes  algorithms  and  data  structures  for  applying  a  parallel  computer 
to  information  retrieval.  Previous  work  has  described  an  implementation  based  on 
overlap  encoded  signatvires.  That  system  was  limited  by  1 )  the  necessity  of  keeping 
the  signatures  in  primary  memory,  and  2)  the  difficulties  involved  in  implementing 
document-term  weighting.  Overcoming  these  limitations  requires  adapting  the 
inverted  index  techniques  used  on  serial  machines.  The  most  obvious  adaptation, 
also  previously  described,  suffers  from  the  fact  that  data  must  be  sent  between 
processors  at  query-time.  Since  interprocessor  commimication  is  generally  slower 
than  local  computation,  this  suggests  that  an  algorithm  which  does  not  perform  such 
commxmication  might  be  faster.  This  paper  presents  a  data  structure,  called  a 
partitioned  posting  file,  in  which  the  interprocessor  communication  takes  place  at 
database^onstruction  time,  so  that  no  data  movement  is  needed  at  query-time. 
Performance  characteristics  and  storage  overhead  are  established  by  benchmarking 
against  a  synthetic  database.  Based  on  these  figmes,  it  appears  that  currently 
available  hardware  can  deliver  interactive  document  ranking  on  databases 
containing  between  1  and  8192  Gigabytes  of  text. 


This  is  a  pre-print  of  a  paper  to  appear  on  Information  Processing  and  Management 
in  1991 .  It  should  not  be  redistributed.  Copyright  1991  Information  Processing  and 
Management. 
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This  paper  describes  algorithms  and  data  structures  for  applying  a  parallel  computer 
to  information  retrieval.  Previous  work  has  described  an  implementation  based  on 
overlap  encoded  signatures.  That  system  was  limited  by  1)  the  necessity  of  keeping 
the  signatures  in  primary  memory,  and  2)  the  difficulties  involved  in  implementing 
document-term  weighting.  Overcoming  these  limitations  requires  adapting  the 
inverted  index  techniques  used  on  serial  machines.  The  most  obvious  adaptation, 
also  previously  described,  suffers  from  the  fact  that  data  must  be  sent  between 
processors  at  query-time.  Since  interprocessor  commimication  is  generally  slower 
than  local  computation,  this  suggests  that  an  algorithm  which  does  not  perform  such 
communication  might  be  faster.  This  paper  presents  a  data  structure,  called  a 
partitioned  posting  file,  in  which  the  interprocessor  communication  takes  place  at 
database-construction  time,  so  that  no  data  movement  is  needed  at  query-time. 
Performance  characteristics  and  storage  overhead  are  established  by  benchmarking 
against  a  synthetic  database.  Based  on  these  figxu'es,  it  appears  that  currently 
available  hardware  can  deliver  interactive  docximent  ranking  on  databases 
containing  between  1  and  8192  Gigabytes  of  text. 

1.  Introduction 

Immense  quantities  of  data  are  currently  being  captured  in  electronic  form.  Most  documents  are 
composed  using  word  processing  or  desktop  publishing  systems.  Much  correspondence  takes  place 
in  electronic  form.  Electronic  bulletin  boards  and  maiUng  hsts  provide  for  the  fast  dissemination  of 
information.  Increasing  nximbers  of  newspapers  and  magazines  are  available  in  electronic  form.  The 
scientific  community  is  gradually  moving  towards  electronic  publishing.  This  growing  body  of 
electronic  text  is  a  potentially  important  resomce,  but  only  if  it  is  possible  to  locate  desired 
information  in  a  timely  manner. 

For  relatively  small  quantities  of  electronic  information,  the  methods  of  library  science  — 
cataloging,  assigning  keywords,  and  careful  organization  —  are  usable.  When  coupled  with  text 
database  systems  based  on  boolean  keyword  searches ,  they  provide  a  method  which,  in  the  hands  of  a 
skilled  searcher,  permits  at  least  some  information  to  be  located.  Questions  as  to  the  effectiveness  of 
these  methods  do  remain,  however,  and  it  is  xmclear  whether,  even  in  the  hands  of  an  expert,  these 
systems  provide  sufficiently  high-quality  searches.  Non-experts  generally  find  these  systems 
difficult  to  use  effectively.  In  any  event,  the  Mbrarianship  required  to  build  these  electronic 
repositories  is  labor  intensive,  and  the  vast  majority  of  electronic  text  is  completely  unorganized. 


There  is  at  least  a  partial  solution  to  this  problem,  in  the  form  of  automatic  information  retrieval  on 
full-text  databases'.  In  a  typical  system  of  this  class,  such  as  Salton's  SMART  system  (1971),  the 
full  text  will  be  passed  through  an  automatic  indexing  system  which  will  reduce  a  document  to  a  set  of 
terms  (which  might  be  words,  word-stems,  or  keywords)  and  ive/^/jW  representing  the  importance  of 
those  terms  to  the  content  of  the  document.  When  a  user  queries  the  database,  he  will  enter  a  set  of 
query  terms.  A  second  automatic  method  will  assign  weights  to  the  query  terms,  and  a  document 
scoring  and  ranking  algorithm  will  be  employed  to  find  those  docxmients  most  exactly  matching  the 
user's  query;  this  is  the  well  known  vector  retrieval  model,  as  described  by  Salton  (1975).  The  user 
may  then  browse  these  docxmients  and,  if  necessary,  refine  the  query  either  manually  or  via  an 
automatic  method,  for  example  relevance  feedback  as  used  by  Rocchio  (1972);  the  docimients  in  the 
database  will  then  be  re-ranked,  and  the  process  repeats  xmtil  the  user  believes  he  has  foimd  a 
sufficient  nimiber  of  documents.  This  method  has  been  found  to  deliver  a  high-quality  search  and, 
when  combined  with  a  graphic  interface,  is  quite  easy  to  use. 

One  key  component  of  such  a  system  is  the  document  scoring  and  ranking  algorithm.  In  a 
conventional  inverted  file  algorithm,  as  described  by  Salton  (1982),  the  output  of  the  indexer  will  be 
sorted  by  term  identifier,  then  the  result  stored  on  disk.  When  the  user  enters  a  query,  those  index 
entries  needed  to  answer  the  query  are  loaded  into  memory  and  any  of  several  algorithms  is  employed 
to  score  and  rank  the  documents.  This  file  structure  has  the  advantage  of  minimizing  the  amount  of 
data  to  be  transferred,  and  requires  a  modest  number  of  I/O  operations.  For  small  databases,  the 
performance  of  the  system  will  be  limited  by  disk  latency.  As  the  database  grows  larger,  the 
disk-to-memory  transfer  time  and  the  time  needed  to  score  and  rank  the  documents  becomes 
increasingly  important,  and  eventually  a  point  will  be  reached  where  the  system  is  too  slow  to  be 
considered  interactive.  The  I/O  throughput  can  be  improved  by  using  a  mainframe  computer  with  a 
large  number  of  disks,  but  as  long  as  a  serial  machine  is  used,  there  are  limited  prospects  for 
improving  the  performance  of  the  scoring  and  ranking  algorithms. 

Parallel  computers,  with  higher  basic  compute  rates  coupled  with  very  high  disk-to-memory 
bandwidths,  are  a  promising  alternative  to  serial  machines  for  the  solution  of  this  problem.  Systems 
based  on  parallel  adaptations  of  the  serial  inverted  file  structure  are  particularly  promising  in  this 
respect.  This  paper  will  present  an  inverted  file  structure  coupled  with  algorithms  for  document 
scoring  and  document  ranking  which,  on  the  Connection  Machine®System,  deUver  over  200  times 
the  performance  attainable  on  a  state-of-the-art  serial  machine  (Sim-4/330).  Depending  on  the 
hardware  configuration  and  the  I/O  strategy  employed,  several  different  system  architectures  may  be 
implemented.  One  implementation,  storing  the  database  entirely  in  primary  memory,  can  handle 
databases  with  up  to  24  Gigabytes  of  data,  delivering  response  times  under  100  milliseconds.  A 
second  implementation,  using  a  disk-array  optimized  for  large  numbers  of  I/O's  per  second,  is 
suitable  for  databases  with  between  3  and  128  Gigabytes  of  data,  and  delivers  responses  in  between 
0.6  and  2  seconds.  A  third  implementation,  using  a  disk  array  optimized  for  high  transfer  rates,  is 
suitable  for  databases  with  between  128  and  8192  Gigabytes  of  data,  and  delivers  responses  in 
/between  2  and  15  seconds. 


1.     See,  for  example,  vanRijsbergen  (1979)  and  Salton  (1989). 
®  Connection  Machine  is  a  registered  trademaric  of  Thinking  Machines  Corporation. 
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1.1  Previous  Work 

Initial  parallel  implementations  of  information  retrieval  by  Stanfill  and  Kahle  (1986),  and  by  Pogue 
and  Willet  (1987),  were  based  on  overlap-encoded  signatures.  Overlap  encoding  is  subject  to  some 
very  specific  constraints: 

—  According  to  Stone  (1 987),  if  the  signature  file  does  not  fit  in  primary  memory  the  I/O  load 
will  prevent  interactive  access. 

—  Only  binary  docimient  weights  can  be  supported;  according  to  Salton  and  Buckley  (1988) 
and  Croft  (1988)  this  may  reduce  the  quality  of  the  search,  relative  to  systems  which  support 
full  vector  weighting. 

Within  the  above  constraints,  Stanfill  (1988,  1990a)  argues  that  signature  files  deliver  excellent 
response  times  .  However,  these  constraints  somewhat  limit  the  usefulness  of  the  method. 

Stone  (1987)  has  suggested  that  parallel  inverted  files  might  not  be  subject  to  the  above  limitations. 
Stanfill,  Thau,  and  Waltz  (1989)  have  described  one  possible  parallel  inverted  file  structure  for  use  on 
the  Connection  Machine.  Stanfill  (1990b)  has  described  a  second  structure  called  partitioned  posting 
files,  which  alters  the  data  layout  so  as  to  reduce  interprocessor  commimication.  This  paper  expands 
on  the  previously  published  description  of  the  partitioned  posting  file  structure. 

1.2  Paper  Organization 

This  paper  will  describe  the  application  of  parallel  partitioned  files  to  databases  between  1  and  8192 
Gigabytes,  on  massively  parallel  computers  with  between  4096  and  65,536  processors.  Section  2 
presents  some  background  in  information  retrieval  and  parallel  computing.  Section  3  describes  serial 
as  well  as  parallel  inverted  file  schemes.  Section  4  describes  the  partitioned  posting  file  structure  and 
presents  some  results  on  the  efficiency  and  storage  overhead  involved  in  this  structure.  Section  5 
describes  how  partitioned  posting  files  may  be  used  to  score  documents.  Section  6  describes  how 
documents,  once  scored,  may  be  ranked.  Section  7  describes  two  I/O  strategies  for  use  witli 
partitioned  posting  files.  Section  8  describes  three  different  system  architectures  which  are  suitable 
for  differing  sizes  of  databases.  Section  9  summarizes  the  results  and  discusses  directions  for  fiuther 
research. 

13  A  Cautionary  Note 

The  performance  results  presented  below  are  based  on  benchmarks  against  a  synthetic  database. 
Caution  must  be  used  in  interpreting  the  results  of  such  studies.  First,  no  real  database  is  going  to 
exactly  match  the  synthetic  one  and,  as  a  result,  the  performance  of  the  system  on  real  data  is  likely  to 
be  different  the  results  presented  here.  Second,  benchmark  figures  are  not  the  same  as 
application-level  performance.  Benchmarks  do  not  accoimt  for  system  overhead,  queuing  delays, 
and  the  stages  of  processing  which  come  before  or  after  the  actual  retrieval  step.  They  do  not  include 
the  sort  of  exception-handling  and  error-detection  needed  to  budd  a  robust  application.  They  do  not 
account  for  the  storage  of  the  actual  full  text.  However,  the  synthetic  database  does  model  a  realistic 
workload,  and  the  document  scoring/ranking  step  does  constitute  the  bulk  of  the  work  required  to 
search  a  database.  In  any  event,  the  benchmark  figines  do  reflect  the  relative  performance  of  various 
methods  of  solving  this  problem. 


2.  Background 

2.1  Information  Retrieval 

A  document  is  represented  as  an  unordered  set  of  weighted  terms.  The  weights  indicate  the 
importance  of  each  term  within  the  document.  The  terms  may  correspond  to  words,  or  to 
word-stems,  or  to  plirases,  or  to  liigh-Ievel  concepts.  The  extraction  of  terms  and  assignment  of 
weights  is  generally  done  automatically,  as  suggested  by  Salton  (1970).  Queries  also  consist  of 
weighted  terms,  and  are  generally  produced  by  a  combination  of  automatic  and  manual  methods; 
such  methods  are  discussed  in  detail  by  by  Salton  (1987). 

The  operation  of  a  retrieval  system  can  best  be  described  in  terms  of  the  vector  model.  A  database 
consists  of  a  set  of  documents  and  a  vocabulary  of  « terms  T, .  A  document!)  is  represented  as  a  vector 
of  length  n  such  that  £>,  >  0  only  if  r,  is  present  in  D.  A  query  Q  has  the  same  representation. 
Retrieval  is  based  on  some  measurement  of  document-query  similarity.  This  paper  will  assxime  the 
cosine  similarity  measure  (Equation  1). 

similarityip,  Q)  =    ^  n  n  ^  n  (1 ) 

The  basic  computation  of  information  retrieval  is  finding  the  document  D  which  maximizes 
similarity(D,  Q). 

Retrieval  methods  based  on  this  vector  model  differ  significantly  from  those  provided  by  most 
commercially  available  text  databases  (e.g.  Westlaw,  Lexis,  Dialog).  The  most  obvious  difference  is 
that  the  commercial  systems  employ  boolean  connectives  such  as  AND/OR  rather  than  mmierical 
weights.  A  less  obvious  distinction  has  to  do  with  mdexing.  The  vector  model,  as  formulated  above, 
assumes  that  a  document  is  automatically  indexed  so  as  to  reduce  it  to  a  set  of  terms  which  reflect  its 
content,  with  weights  reflecting  the  importance  of  those  terms.  If  "terms"  are  equated  with  "words" 
or  "word  stems,"  difficulties  will  inevitably  arise.  For  example,  if  a  document  containing  the 
text-string  "New  Mexico"  is  indexed  as  containing  only  the  terms  "New"  and  "Mexico,"  then  that 
docxmientmay  be  effectively  unretrievable:  for  every  document  containing  "New  Mexico"  there  may 
be  a  hundred  which  coincidentally  contain  "New"  and  "Mexico,"  but  refer  to  the  coimtry  rather  than 
the  state.  The  commercial  systems  mentioned  above  overcome  this  problem  by  recording  the 
location  of  every  word  in  the  database  and  providing  a  proximity  operator  which  allows  the  user  to 
search  for  "New"  immediately  foUowed  by  "Mexico."  These  proximity  operators  are  extremely 
powerful,  and  allow  for  a  nimiber  of  retrieval  strategies  which  depend  on  the  context  in  which  a  term 
occurs. 

Similar  difficulties  arise  from  the  morphological  structure  of  language.  English  morphology  may,  for 
the  most  part,  be  handled  at  indexing-time  by  stripping  suffixes  from  words  to  arrive  at  word-stems , 
/or  at  query-time  by  allowing  the  user  to  incorporate  wild-cards  at  the  end  of  search  terms  (tail 
truncation).  Such  simple  strategies  may  not  suffice  in  other  languages.  For  example,  German  uses 
long  compound  nouns;  searching  a  German  database  might  thus  require  both  left-  and  right- 
wildcards  in  search  terms,  or  extensive  morpohological  analysis  to  break  the  compounds  into  their 
component  noims.  In  Japanese,  the  boundaries  between  words  are  not  indicated  by  spaces,  resulting 
in  considerable  ambiguity.  Searching  a  Japanese  database  might  thus  require  either  arbitrary 
substring  search,  or  some  form  of  linguistic  analysis  to  indicate  which  words  are  actually  present. 


Effective  use  of  the  algorithms  as  described  below  thus  depends  on  proper  automatic  indexing ,  which 
must  include  as  a  minimum  some  mechanism  for  detecting  multi-word  proper  nouns  such  as  New 
Mexico.  In  cases  where  such  indexing  is  not  possible,  proximity-based  methods  might  be 
mandatory.  The  algorithms  and  data  structures  described  below  can  probably  be  modified  to 
incorporate  proximity  operators,  but  a  discussion  of  such  issues  is  beyond  the  scope  of  this  paper. 
Ultimately,  however,  improved  automatic  indexing  is  probably  more  desirable  than  simply  making 
complex  proxunity-based  search  methods  run  faster. 

2.2  A  Synthetic  Database 

Evaluation  of  scoring  and  ranking  algorithms  will  be  done  using  a  synthetic  database  and  query  load 
described  by  StanfiU  (1989).  This  has  the  advantage  that  1 )  the  database  may  be  easily  replicated  by 
other  researchers;  2)  the  parameters  of  the  database  (such  as  size)  may  be  altered  to  explore  the 
behavior  of  retrieval  algorithms;  3)  large  quantities  of  disk  need  not  be  tied  up;  and  4)  generation  of 
the  database  can  be  selective:  given  a  10-term  query,  only  that  portion  of  the  posting  fOe  needed  to 
evaluate  that  query  needs  to  be  synthesized. 

The  lexicon  for  the  database  consists  of  n  terms,  T,  through  T„ .  The  frequency  /(T,)  of  T, 
(occurrences  per  megabyte)  is  given  by  equation  2,  in  accordance  with  Zipf 's  law.  Terms  1  through  s 
are  stop  words,  and  are  not  put  into  the  posting  file.  It  is  assumed  that  the  terms  found  in  queries  have 
the  same  frequency  distribution  as  the  terms  m  the  database,  omitting  stop-words.  The  probability 
distribution  of  a  random  query  term  Qj  is  given  by  equation  3.  Equation  3  is,  of  course,  subject  to  the 

constraint  that  Qj  be  a  probability  distribution  (Equation  4). 

ATd  =  -  (2) 

I 


TTiQj  =  Td=- 


i<s  U 

otherwise  — - 


(4) 


'll 


The  amoxmt  of  work  done  by  an  inverted  file  algorithm  is  governed  by  how  many  times  a  query-term 
occurs  in  the  database.  For  example,  a  rare  term  might  occxir  only  once,  andhence  involve  a  relatively 
small  amoxmt  of  work;  a  common  term  might  occur  a  milUon  times,  and  hence  involve  a  larger 
amoimt  of  work.  The  distiibution  of  query-term  frequencies  is  thus  of  great  importance.  This 
distribution  is  /(g.) ,  and  the  expected  frequency  of  a  randomly  selected  query-term  is  Z  =  E{f{Qj)) . 

(Equation  5). 

n  n 
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The  amount  of  disk  required  to  store  an  inverted  file  is  determined  by  the  total  number  of  non-stop 
words  in  the  database  which  depends,  in  turn  on  the  total  term-frequency  Rt  (Equation  6).  The  total 
number  of  terms  in  a  database  of  IDI  megabytes  will  then  be  RjlDl . 


Rr=Y^m  =  ^-^ 


(6) 


In  practice,  it  is  not  possible  to  directly  measure  ci  or  C2.  One  may,  however,  measure  Zand  Rt  .  For  a 
database  of  newswire  articles  a  value  of  Z  =  3  is  reasonable  (i.e.  a  randomly  selected  query-term 
occurs  three  times  per  megabyte).  Similarly,  a  value  of  Rt  =  58,0(X)  (i.e.  one  term  per  17.3  bytes)  is 
reasonable  for  newswire  data.  At  this  point  one  may  freely  choose  one  of  the  four  parameters  (in  this 
case,  n  =  200000),  and  use  the  constraints  in  equations  4, 5,  and  6  to  determine  values  for  s,  c  i,  and  Ca. 
These  values  are  given  in  Table  1. 

n  200,000 

s  550 

c,  9778 

C  .1696 

Table  1:  Database  Parameters  for  Newspaper  Articles 

Finally,  the  performance  of  the  ranking  algorithm  depends  on  the  number  of  documents  m  the 
database  and  the  number  of  documents  to  be  actually  ranked  and  returned  to  the  user.  This  paper  will 
assume  an  average  document-length  of  5000  bytes,  and  a  rank-count  of  20.  The  query  length 
(number  of  terms  per  query)  is  also  important  in  predicting  system  performance.  This  system  will 
assume  two  query-types  might  be  desired:  relatively  small  queries  having  10  terms,  and  somewhat 
larger  queries  with  30  terms.  The  former  might  be  generated  manually;  the  later  might  be  created  by  a 
semi-automatic  method  such  as  thesaurus-based  expansion  or  relevance  feedback. 

Six  parameters — database  size,  average  query-term  frequency,  query  length,  rate  of  occurrence  of 
non-stop-words,  average  document  size  and  number  of  documents  returned  —  will  be  different 
from  one  database  to  another,  but  have  essentially  mdependent  effects.  The  database  size,  average 
query-term  frequency,  and  the  query  length  determine  the  absolute  amount  of  work  to  be  done  in  the 
scoring  phase.  The  shape  of  the  query-term  frequency  distribution  may  affect  the  efficiency  of  the 
algorithm.  The  database  size  and  the  frequency  of  non-stop-words  determines  the  amount  of  storage 
required  to  store  the  inverted  indexes  and,  again,  the  shape  of  the  distribution  affects  the  efficiency  of 
the  file  layout.  The  database  size,  the  average  document  length,  and  number  of  documents  to  be 
ranked  affect  the  effort  required  to  rank  the  documents  once  scored.  The  shape  of  the 
dociunent-length  distribution  is  probably  not  very  important  to  these  algorithms. 

A  database  is  synthesized  by  specifying  the  number  of  documents  and  size  of  the  full  text  in 
Megabytes.   If  there  are  N^^  documents  and   b  megabytes  of  data  in  the  database,  then  (on  the 

average)  -^  occurrences  of  term  /  are  generated  and  randomly  assigned  document  identifiers 

between  0  and  N^^-l .   Queries  are  generated  by  a  similar  mechanism. 
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2.3  Data  Parallel  Computing 

This  paper  assumes  the  data  parallel  programming  model  as  described  by  Hillis  and  Steele  (1986). 
The  Connection  Machine  System,  as  described  by  Hillis  (1985)  and  realized  in  the  Connection 
Machine  Model  CM-2  by  Thinking  Machines  Corporation  (1987),  will  be  used  as  the  hardware 
model.  The  model  includes  a  serial  host  computer  and  a  large  number  of  processing  elements  (PE's  or 
processors).  Data  structures  are  distributed  uniformly  across  the  processing  elements,  and  may  be 
thought  of  as  vectors.  There  are  three  aspects  of  computation:  serial  computation,  local  parallel 
computation,  and  non-local  parallel  computation.  Serial  computation  consists  of  arbitrary 
operations  on  scalars.  Scalars  may  be  freely  promoted  to  vectors  by  broadcasting  their  value  to  the 
processing  elements.  Local  parallel  computation  consists  of  arbitrary  element-wise  computations 
applied  to  the  contents  of  the  PE's  memories .  In  this  mode,  the  processing  elements  may  be  thought  of 
as  a  set  of  independent  machines  operating  on  scalar  quantities  stored  in  their  local  memories. 
Processing  elements  may  temporarily  deactivate  themselves.  Non-local  parallel  computation 
consists  of  operations  that  move  data  either  from  one  processor  to  another  or  from  the  processing 
elements  to  the  host.  Non-local  computations  may  be  as  simple  as  permuting  the  data  or  as  complex 
as  sorting  it.  Systems  having  between  4096  and  65 ,536  processing  elements  will  be  considered  in  this 
paper. 

Each  processor  has  its  own  local  memory.  Connection  Machines  may  be  configured  with  up  to  1 28 
Kilobytes  per  processor;  this  translates  into  1/2  Gigabyte  for  a  4K  processor  machine  or  8  Gigabytes 
for  a  64K  machine.  Processors  have  indirect  addressing  hardware  which  allows  them  to  access  their 
local  memory  via  an  index  register.^  In  addition,  the  indirect  addressing  hardware  permits  a  group  of 
32  processors  (called  a  node)  to  share  each  other's  memory.  For  analytic  purposes,  one  may 
sometimes  treat  a  node  as  a  single  processor,  so  that  a  64K  processor  machine  might  be  thought  of  as  a 
machine  with  2048  nodes,  each  of  which  is  itself  a  32-processor  shared-memory  parallel  SIMD 
machine.  This  point-of-view  will  frequently  be  taken  in  the  discussion  below. 

I/O  is  provided  by  a  high-throughput  disk  system.  The  Connection  Machine  uses  a  parallel  disk  array 
called  the  Data  Vault"™  mass  storage  system.  Each  such  unit  provides  up  to  40  Gigabytes  of  storage 
with  a  transfer  rate  of  25  Megabytes  per  second.  Data  Vault  files  are  vector-structured:  each  location 
in  the  file  stores  one  byte/word  from  each  processing  element  in  the  machine.  The  datavault  may  be 
used  in  two  modes:  striped  mode  in  which  the  datavault  simulates  a  single  disk  with  a  very  high 
transfer  rate  by  striping  data  across  all  drives;  and  independent  access  mode  in  which  all  32  drives 
may  be  accessed  independently  to  simulate  a  disk-farm  with  a  slightly  lower  transfer  rate  but  a  greater 
number  of  transfers-per-second.  Up  to  8  Data  Vaults  may  simultaneously  be  active,  providing 
transfer  rates  of  up  to  200  Megabytes  per  second. 

3.  Adapting  Inverted  Files  to  Parallel  Computing 

This  section  will  present  a  simple  implementation  of  inverted  files  on  serial  machines,  then  explain  its 
adaptation  to  parallel  computing. 

3.1  Serial  Inverted  Indexes 

Perparatory  to  considering  parallel  inverted  indexes,  it  is  worth  considering  the  implementation  of 
inverted  indexes  on  serial  machines,  and  the  performance  of  such  an  implementation.  A  reasonably 

2.  This  capability  was  missing  from  some  early  SIMD  parallel  machines,  such  as  the  Connection  Machine  model 
CM-1,  Active  Memory  Technology's  DAP  (q.v.  Flanders,  1977)  and  the  Goodyear  MPP  (q.v.  Batcher,  1980).  In 
these  machines,  all  processors  were  forced  to  access  the  same  memory  address  on  a  given  instruction  cycle. 


simple  implementation,  similar  to  that  described  by  Doszkocs  (1982),  will  be  presented  and  its 
performance  analyzed. 

This  paper  presumes  that  the  database  has  been  indexed,  and  each  document  reduced  to  a  set  of 
structures  of  the  form  <document-id,  term-id,  weight>  where  document-id  is  an  integer  uniquely 
identifying  a  document;  term-id  is  an  integer  uniquely  identifying  a  term;  and  weight  is  a  number 
representing  the  importance  of  the  term.  This  structure  is  referred  to  as  a  posting. 

Consider,  for  example,  the  set  of  documents  shown  in  Figure  1 .  It  consists  of  four  documents,  each 
containing  between  three  and  five  words.  The  first  step  in  indexing  it  would  be  to  assign  a  term 
identifier  to  each  of  the  eleven  different  words  in  the  document  set,^  as  is  done  in  Figure  2.  Similarly, 
each  document  may  be  assigned  an  identifier,  starting  at  0.  Weights  are  assigned  by  some  suitable 
indexing  procedxire.     In  this  case,    each  term  will  receive  a  weight  of  1  which,  when  the 

document-vectors  are  normalized,  will  become  -jt— -n- .  The  result  the  set  of  raw  postings  shown  in 
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Figure  3. 
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Figure  1:  Four  sample  documents  used  in  examples  throughout  this  paper. 
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Figure  2:  Assignment  of  terms  to  term  identifiers 
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Figure  3:  Raw  Posting  File  for  the  documents  in  Figure  1 

The  first  step  in  producing  an  inverted  file  is  sorting  these  raw  postings  by  term  identifier  (Figure  4). 
The  positions  of  the  first  and  last  occurrences  of  each  term  are  next  recorded  in  a  table  called  the  data 
'  map,  which  will  also  include  the  mapping  from  text  strings  to  term  identifiers  (Figure  5).  The  term 
identifiers  in  the  postings  are  now  redimdant  and  are  dropped.  The  result  is  a  inverted  file,  which  is 
stored  on  disk  (Figme  6). 


3.      Many  of  these  words,  such  as  am  and  /  are  stop  words  and  would  normally  be  dropped.  They  are  retained  here  for 
the  sake  of  the  example. 
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Figure  4:  Sorted  Postings 


ID 

Word 

Start 

End 

ID 

Word 

Start 

End 

0 

am 

0 

1 

6 

ts 

10 

10 

1 

be 

2 

2 

7 

the 

11 

11 

2 

document 

3 

5 

8 

this 

12 

13 

3 

first 

6 

6 

9 

three 

14 

14 

4 

fourth 

7 

7 

10 

two 

15 

15 

5 

I 

8 

9 

Figure  5:  Data  Map 
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Figure  6:  Inverted  File 

A  query  consists  of  a  set  of  terms  and  weights,  for  example  3  *  document  +  2  *  this.  Evaluating  a 
query  takes  place  in  three  stages:  initialization,  scoring,  and  ranking.  Initialization  consists  of 
allocating  one  score  register  (called  a  mailbox)  for  each  document  and  zeroing  it  (Figure  7).  Next,  the 
data-map  entry  for  each  term  is  accessed,  so  that  the  positions  of  the  first  and  last  postings  for  each 
query-term  are  known.  In  the  query  shown  above,  the  term  document  occupies  positions  3  through  5 , 
and  the  term  this  occupies  positions  12-13.  The  appropriate  portions  of  the  posting  file  are  then 
brought  into  primary  memory  (Figwe  8).  The  final  step  is  to  iterate  through  the  query-terms,  and 
through  the  postings  for  each  query-term.  The  query-term  weights  are  multiplied  by  the 
document-term  weights  (found  in  the  postings),  and  used  to  increment  the  score  in  the  mailbox 
indexed  by  the  posting's  document  identifier.  The  results  are  shown  in  Figure  9. 
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Figure  7:  Mailboxes  initialized  to  0 
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Figure  8:  Postings  for  each  query  term  are  loaded  into  memory 
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Figure  9:  Results  of  Query  on  Serial  File 

Finally,  the  mailboxes  are  scanned  to  locate  the  iV„,  highest-ranking  documents.  This  paper  assumes 
N„,  =  20  .  A  good  serial  algorithm  may  be  described  as  follows: 
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1)  The  scores  are  divided  into  n  groups  of  — —  documents.. 

2)  The  largest  element  of  each  group  is  located,  placed  in  an  array,  and  the  original  is  zero'ed 
out. 

3)  The  array  is  converted  to  a  heap. 

4)  The  first  element  of  the  heap  is  extracted.  It  is  the  highest-scoring  docimient  in  the  database . 

5)  The  group  from  which  it  came  is  determined.  That  group  is  re-scanned,  and  a  new  group 
maximum  is  found. 

6)  The  maximal  element  is  put  in  the  heap,  and  the  original  is  zeroed. 

7)  Steps  4-6  are  repeated  until  N,^  documents  have  been  found. 

This  algorithm  requires  the  construction  of  an  /i-element  heap  (step  3)  and  7V,«   insert/delete 
operations  (step  6).  The  time  for  the  heap  operations  is  thus  (n  +  Nr^)  log  n  ,  and  is  clearly  minimized 

by  making  n  as  small  as  possible.  The  algorithm  also  requires  n  +  N^^  scans  of  groups  of  ~^ 

mailboxes  (steps  2  and  5).  The  total  time  for  the  scanning  is  thus  («  +  N^)  — ^^  =  N^^,  +    '         ,  and 

is  clearly  minimized  by  making  n  as  large  as  possible.    An  optimal  value  for  n  may  be  found  either 
analytically  or  empirically. 

3.2  Performance  of  the  Serial  Algorithm 

It  is  useful  to  measure  the  performance  of  the  serial  algorithm  as  a  baseline  in  judging  the  performance 
of  the  parallel  algorithms.  The  scoring  algorithm  was  benchmarked  by  creating  an  array  of  100,000 
postiQgs  and  an  array  of  100,0(X)  mailboxes;  the  scoring  algorithm  was  then  applied  to  all  100,000 
postings.  The  time  measured  on  a  fast  serial  processor  (a  Sun  4/330)  was  0.78  seconds,  giving  a 
processing  rate  of  .13  million  postings  per  second.'*  The  ranking  algorithm  was  also  measured  on  a 
Sun  4/330.  With  n  =  1000  (which  was  empnically  determined  to  be  optimal  m  this  case),  finding  the 
highest-^-anking  20  documents  from  a  collection  of  1 00,000  mailboxes  required  0. 1 1  seconds,  giving 
a  ranking  rate  of  .91  million  mailboxes  per  second.  The  CPU  performance  may  then  be  extrapolated 
to  various  database  sizes  by  computing,  for  each  database  size  the  total  number  of  postings  and  total 
number  of  mailboxes  to  be  scanned  for  each  query.  Table  2  presents  performance  estimates  for 
databases  between  1  and  128  Gigabytes,  for  queries  of  10  and  30  terms.  It  must  be  noted  that  the 
figures  below  assume  a  main-memory  database;  no  allowance  is  being  made  for  the  I/O  time  required 
to  load  postings  into  memory.  Thus,  actual  appUcation-level  times  are  likely  be  somewhat  greater 
than  predicted  by  the  model  below.  These  figures  do,  however,  provide  a  meaningful  lower  bound  on 
the  time  to  score  and  rank  documents  on  at  least  one  high-performance  serial  machine.  Other  serial 
'  machines  might,  of  course,  be  either  faster  or  slower,  and  in  particular  it  is  quite  Ukely  that  a  more 
advanced  microprocessor  or  a  mamframe  might  achieve  significantly  higher  performance. 

4.  Each  posting  was  packed  into  a  32-bit  quantity  (8  bits  weight,  24  bits  document  identifier);  unpacking  this  quantity 
added  to  the  run  time.  Higher  performance  might  be  obtained  by  using  larger,  unpacked  postings,  but  this  is  generally 
a  poor  tradeoff,  since  memory  capacity  (for  main-memory  databases)  and  I/O  bandwidth  (for  disk-resident  data- 
bases) are  generally  more  constraining  than  pure  CPU  performance  in  a  well-balanced  system. 
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File  size,  Gigabytes 

1 

2 

4 

8 

16 

Postings/Teim  (10(X)'s) 

3 

6 

12 

24 

48 

Score  1  Teim  (msec) 

23 

46 

92 

184 

368 

Score  10  Tenns  (msec) 

230 

460 

920 

1840 

3680 

Score  30  Tenns  (msec) 

690 

1380 

2760 

5520 

*♦* 

Documents  (millions) 

.2 

.4 

.8 

1.6 

3.2 

Rank  (msec) 

219 

438 

876 

1752 

3504 

Total,  10  teims  (msec) 

449 

898 

1796 

3952 

7184 

Total,  30  tenns  (msec) 

909 

1818 

3636 

7272 

*** 

Table  2:  Performance  of  the  serial  algorithm  on  a  Sun  4/330, 

Some  improvements  in  the  performance  of  the  ranking-portion  of  the  serial  algorithm  may  be 
obtained  by  pruning  strategies.  Such  strategies  are  based  on  the  fact  that  dociunents  which  match 
only  high-frequency  (hence  low-selectivity)  terms  in  a  query  are  unlikely  to  be  relevant  to  the  user 's 
request.  According  to  Harmon  (1 990),  pruning  may  reduce  the  nimiber  of  documents  to  be  ranked  by 
60%  without  significantly  affecting  the  results  of  the  search.  More  elaborate  pruning  schemes  have 
been  proposed  by  Buckley  and  Lewit  (1985).  The  parallel  algorithms  described  below  might  benefit 
from  such  pruning,  but  such  modifications  will  not  be  considered  in  this  paper. 

33  Parallel  Inverted  Indexes 

Stanfill,  Waltz,  and  Thau  (1989)  describe  a  parallel  algorithm  that  takes  the  posting  file,  in  the  form 
described  above,  and  places  it  in  a  Connection  Machine  so  that  adjacent  entries  in  the  posting  file  are 
in  adjacent  processors.  For  example  mapping  the  posting  file  in  Figure  6  to  a  4-processor  machine 
resxilt  in  4  postings  being  placed  in  each  processor  (Figure  10).  Mailboxes  are  assigned  so  that  the 
mailboxes  for  consecutive  documents  map  to  consecutive  processors  (Figure  11). 
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Figure  10:  Assignment  of  Postings  to  Processors 
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Figure  11:  Assignment  of  Mailboxes  to  Processors 

To  start  processing  the  query,  the  location  of  the  postings  for  each  query  term  must  be  determined. 
The  data  map  (Figiire  5)  is  accessed  as  before  to  determine  the  file-offset  for  the  first  and  last  postings 
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of  each  term.  On  a  machine  with/?  processors,  location  n  of  the  serial  posting  file  maps  to  memory 
location      —       in  processor  n  modp  .  In  the  above  example,  the  postings  for  document  occupy 

positions  3  through  5  of  the  serial  posting  file.  These  locations  map  to  row  0  processor  3  through  row 
1  processor  2  of  the  parallel  file.  These  location-ranges  are  then  broken  into  groups  that  do  not  span 
row  boundaries  (Figure  12). 


Word 

Weight    Row    Start    End 

Document 
This 

3            0          3          3 
3            10          1 
2           3         0         1 

Figure  12:  Mapping  query  terms  to  non-spanning  groups  of  postings 

The  I/O  system  is  then  called  on  to  move  these  postings  into  memory.  As  noted  above  in  the  section 
on  parallel  computing,  the  I/O  system  operates  by  writing  vectors  (rows  of  data)  to  disk,  and  reading 
vectors  into  memory.  The  algorithm  then  iterates  through  the  rows  of  this  table.  For  each  <weight, 
row,  start,  end>  quadruple,  those  processors  having  processor  ID's  between  start  and  end  will  access 
the  posting  at  memory  location  row.  The  query-weight  will  be  multiplied  by  the  posting-weight. 
The  result  will  be  used  to  increment  the  contents  of  the  appropriate  mailbox.  This  last  step  involves 
sending  data  between  processors. 

In  this  example,  the  postings  for  document  occupy  processor  3  of  row  0,  and  processors  0  and  1  of  row 
1 .  Execution  starts  by  considering  row  0  (Figure  13).  First,  the  algorithm  notes  that  only  processor  3 
contains  row-0  postings  for  document;  all  other  processors  are  deactivated.  Second,  this  processor 
accesses  the  DOC  BD's  and  document  weights  of  row  0.  Third,  the  document  term-weight  (.20)  is 
multiplied  by  the  query  term-weight  (3),  yieldmg  the  posting's  contribution  to  its  document's  score 
(.60).  Finally,  the  processor  sends  a  message  to  processor  0,  telling  it  to  add  its  contribution  (.6)  to  the 
mailbox  for  document  0. 

Execution  now  considers  row  1  (Figure  14).  Fust,  the  algorithm  notes  that  only  processors  0  and  1 
contain  row-l  postings  for  document;  once  again  all  other  processors  are  deactivated.  Second,  DOC 
id's  and  document  weights  for  row  1  are  accessed.  Third,  the  document  weight  is  multiplied  by  the 
query  weight.  Fourth,  each  processor  determines  the  location  of  its  document's  mailbox.  Finally, 
each  processor  sends  a  message  to  increment  the  proper  mailbox. 
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Figure  13:  Handling  the  postings  in  row  0 
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Figure  14:  Handling  the  postings  in  row  1 

On  the  Connection  Machine,  incrementing  the  mailboxes  is  accomplished  via  an  instruction  called 
send-with-add.  This  operation  is  the  dommant  cost  of  this  algorithm,  each  such  step  taking  3 
milliseconds.^  The  algorithm  just  described  yields  good  performance,  but  experience  has  shown  that 
such  inner-loop  commimication  operations  should,  where  feasible,  be  eliminated  in  order  to  get  the 
best  possible  performance  out  of  the  system.  A  new  file  structure,  partitioned  posting  files,  will  now 
be  introduced  to  accomplish  this  purpose. 

4.  Partitioned  Posting  Files 

The  inner-loop  send-with-add  imXtnc^on  is  fundamentally  unnecessary  because  document  scoring 
is  completely  decomposible:  if  one  has  N  documents,  the  scoring  operation  may  be  accompUshed  by 
N  processors  in  unit  time,  with  no  commvmication.  The  original  signature-based  algorithm  was 
implemented  in  essentially  that  manner,  with  each  virtual  processor  being  assigned  a  single 
docimient.  The  goal,  then,  is  to  find  a  representation  which  has  the  desirable  properties  of  both  the 
inverted  file  algorithm  (stiitability  for  use  with  external  storage,  support  for  document-term  weights) 
and  the  signature-algorithm  (locality).  This  section  will  arrive  at  such  a  data  structure  by  starting  with 
a  posting  file  (essential  for  support  of  document-term  weights),  arranging  the  data  so  as  to  eliminate 
query-time  communication,  and  finally  partitioning  the  data  so  as  to  arrive  at  a  structure  suitable  for 
use  with  external  storage. 

The  send-with-add  step  may  be  eliminated  if,  rather  than  storing  the  postings  m  sequential  order  and 
send'\R%  them  to  the  correct  processor  at  query-time,  they  are  stored  in  the  same  node*  as  their 
mailbox.  This  file  structure  starts  with  the  same  set  of  tokens  used  to  construct  serial  inverted 
indexes .  Each  document  is  mapped  to  a  node.  For  example,  given  a  machine  with  two  processors, 
documents  0  and  2  might  be  assigned  to  node  0,  and  docimients  1  and  3  might  be  assigned  to  node  1 . 
The  tokens  are  then  moved  to  an  arbitrary  processor  within  that  node,  and  sorted  by  ascending  term  ID 
within  the  node's  shared  memory  (Figure  15). 

'  5.      Stanfill,  Waltz,  and  Thau  ( 1989)  estimated  the  time  to  perform  this  operation  at  1  millisecond.  Subsequent  bench- 
marking revealed  this  estimate  to  have  been  low. 

6.  As  defined  above,  a  node  is  a  group  of  processors  sharing  common  memory.  In  the  case  of  the  Connection  Machine, 
a  node  is  a  group  of  32  processors.  In  the  case  of  machines  with  no  shared  memory,  a  node  is  a  single  processor. 
In  the  case  of  a  machine  with  only  shared  memory,  the  entire  machine  may  be  considered  a  single  node,  although 
if  there  is  both  fast  access  to  local  memory  and  slower  access  to  shared  memory  it  may  be  best  to  ignore  the  shared 
memory  and  consider  each  processor  as  one  node. 
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Figure  15:  Distributing  postings  to  their  home  node 

A  problem  with  the  representation  at  this  point  is  that  the  postings  for  a  given  term  are  not  guaranteed 
to  be  in  the  same  row  of  data.  This  is  called  skewing.  For  example,  the  postings  for  term  8  are  located 
in  row  7  of  node  0  and  row  5  of  node  1 .  This  causes  great  difficulty  when  databases  resident  on 
secondary  storage  are  constructed,  since  all  rows  of  postings  from  the  first  occurrence  of  a  term 
through  lastmust  be  transferred  to  memory.  Thus,  in  the  example  above  rows  5  through  7  would  need 
to  be  read.  Unless  countermeasures  are  taken,  this  skewing  grows  without  bound  as  the  database  size 
increases. 

Skewing  cannot  be  eliminated,  but  it  can  be  kept  within  boxmds  hy  partitioning  the  posting  file.  Each 
partition  has  a  lower  bound  and  an  upper  boimd.  Each  partition  contains  only  postings  with  word 
identifiers  between  those  two  bounds,  inclusively.  Furthermore,  the  upper  boimd  of  each  partition  is 
no  greater  than  the  lower  boimd  of  the  next.  For  example,  the  database  above  might  be  divided  into 
five  partitions  with  bounds  (0  ...2)  (2 ...  4)  (5  ...  6)  (7  ...  8)  and  (9  ...  10).  Partition  boundaries  are 
stored  on  the  host  computer  as  part  of  the  data  map.  Partitioning  is,  in  essence  a  mechanism  for 
forcing  the  periodic  introduction  of  empty  space  in  the  posting  file  so  as  to  retain  a  degree  of 
alignment. 

To  create  a  partitioned  posting  file,  it  is  necessary  to  select  a  blocking  factor  /,  which  will  be  the 
number  of  postings  per  partition,  per  node.  Let  /  be  this  blocking  factor  (in  the  present  example  1  =  2). 
An  arbitrary  processor  in  each  node  accesses  the  node's  /'th  posting;  this  is  called  a  candidate 
boundary  value.  The  host  then  uses  a  global  minimum  operation  to  determine  the  smallest  candidate 
boundary  value;  this  becomes  the  upper  bound  ([/)  for  a  new  partition.  All  postings  in  rows  0  through 
l-J  with  term  ID  ^  f/  are  moved  into  the  new  partition.  Remembering  that  the  postings  are  sorted  in 
ascending  order,  and  that  the  upper  bound  is  the  smallest  element  in  row  /,  it  can  be  seen  that  the 
residue  file  contains  all  postings  with  ED  >  J7,  and  the  new  partition  contains  all  postings  with  ID  <  U. 
New  partitions  are  created  until  no  data  remains. 

For  the  example  in  Figure  15,  row  2  of  the  posting  file  contains  the  term  DD's  2  and  2.  These  are  the 
candidate  boimdary  values.  The  boundary  value  is  then  also  2.  All  postings  in  rows  0  and  1  of  the 
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unpartitioned  file  which  are  no  greater  than  2  are  transferred  to  the  partitioned  file,  and  the  bounds  are 
noted.  This  yields  the  new  partition  shown  in  Figure  16,  and  the  residue  of  unconsmned  postings 
showninFigme  17.  This  process  is  repeated  until  all  postings  have  been  transferred.  The  result  is  the 
partitioned  posting  file  shown  in  Figure  18.  It  is  immediately  obvious  that  the  data  structure  in 
Figure  18  does  not  fully  utilize  the  available  space:  the  file  contains  sufficient  space  for  18  postings, 
of  which  16  are  actually  used,  for  a  utilization  of  89%. 
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Figure  16:  The  first  partition,  with  associated  low  and  high  bounds. 
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Figure  17:  Unconsumed  postings  after  the  first  partition  is  created. 
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Figure  18:  Partitioned  posting  file 

hi  the  serial  and  parallel  implementations  of  mverted  files,  a  data  map  precisely  defines  which 
postings  are  associated  with  which  terms.  The  term  identifiers  are  thus  unneeded  and  are  dropped. 
With  partitioned  posting  files,  the  upper  and  lower  bounds  of  the  partition  do  not  uniquely  determine 
the  term  associated  with  each  posting.  Thus  the  term  identifiers  cannot  simply  be  dropped.  Naively, 
one  might  simply  retain  the  term  identifier  in  the  posting  file.  Note,  however,  that  each  partition 
contains  a  limited  number  of  different  terms,  which  allows  the  representation  of  term  identifiers  with 
short  integers.  If,  for  example,  no  partition  contains  more  then  32  distinct  terms,  then  5  bits  suffice  to 
uniquely  distinguish  the  terms  within  a  partition.  Such  a  compressed  term  identifier  will  be  referred 
to  as  a  term  tag.  The  document  identifier  can  also  be  compressed.  If  there  are  D  documents  and  N 
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nodes,  then  there  are  —  dociunents  per  node,  and  a  document  may  be  uniquely  represented  by 

N 

log2—      bits.  This  compressed  document  identifier  will  be  called  a  rfocMwe«?  to^.  Fmally,  one 

may  (generously)  allocate  16  bits  to  the  weight  contained  in  the  posting.  The  result  is  as  shown  in 
Figure  19. 
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Figure  19:  Compressed  Partitioned  Posting  File 

The  scalar  portion  of  this  data  structure  resides  in  the  host's  primary  memory;  the  vector  data  will 
generally  be  stored  on  parallel  secondary  storage,  and  moved  into  memory  one  partition  at  a  time.  A 
data  map  (e.g.  a  hash  table)  mapping  terms  to  term  ID's  and  the  range  of  partitions  containing  postings 
for  that  term  completes  the  data  structure. 

4.1  Storage  Requirements 

Three  factors  determine  storage  requirements:  the  fraction  of  empty  space  in  the  partitioned  posting 
file,  the  number  bits  in  the  term-tag,  and  the  number  of  bits  in  the  document-tag.  The  number  of  bits 
in  the  document  tag  was  derived  above.  The  number  of  bits  in  the  term  tag  and  the  fraction  of  empty 
space  in  the  posting  file  may  be  estimated  by  simulation.  The  simulation  procedure  is  as  follows: 

1)  A  random  non-stop-word  term  T,  is  selected. 

2)  Its  frequency  is  determined  by  equation  2. 

3)  Its  frequency  is  multiplied  by  the  size  of  that  database  to  yield  the  number  of  occurrences  in 
the  database. 

4)  These  occurrences  are  distributed  at  random  among  256  nodes  (corresponding  to  an 
8K-processor  machine). 

5)  Steps  1^  are  repeated  until  sufficient  data  has  been  generated  to  yield  statistically  significant 
'             results. 

6)  The  resulting  set  of  postings  are  partitioned.  The  partition  boundaries  are  recorded,  as  are  the 
number  of  empty  positions.  A  block-size  of  128  postings  per  partition/node  is  employed. 

7)  The  utilization  (%  of  available  posting  slots  actually  occupied)  is  computed. 

8)  The  number  of  bits  in  the  document-  and  term-  tags  are  computed.  The  size  of  the  posting  is 
computed  assvuning  16  bits  for  weights. 
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9)  The  total  number  of  postings  in  the  database  is  determined  by  dividing  by  the  frequency  of 
non-stop-words  (1  per  17.3  bytes  in  this  example). 

1 0)  The  total  space  is  obtained  by  multiplying  the  number  of  postings  by  the  size  of  the  postings , 
divided  by  the  utilization. 

11)  The  overhead  is  obtained  by  dividing  the  posting-file  size  by  the  size  of  the  database. 

1 2)  For  machines  larger  than  256  nodes,  it  is  assumed  that  doubUng  the  quantity  of  data  and  the 
number  of  processors  keeps  the  occupancy  and  posting  size  constant,  leaving  storage 
overhead  unchanged. 

Table  3  shows  the  results  of  this  simulation  for  databases  between  1  and  1024  Gigabytes,  assuming  a 
256  node  (8K  processor)  machine.  Equivalent  database  sizes  for  machines  having  512  (16KP),  1024 
(32KP)  and  2048  (64KP)  are  also  shown.  Storage  overhead  is  remarkably  constant  over  a  huge  range 
of  database  size,  with  improved  occupancy  being  exactly  balanced  by  growth  in  the  posting  size. 
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.31 
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31% 

30% 

30% 
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29% 
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29% 

28% 

29% 

29% 

Table  3:  Storage  utilization 

As  noted  above,  a  256  node  (8KP)  system  has  1  Gigabyte  of  primary  memory.  Under  that  assumption, 
a  database  3  Gigabytes  would  fit  in  primary  memory  and  still  leave  100  Megabytes  of  scratch  space. 
In  addition,  it  has  been  noted  that  a  Datavault  can  hold  up  to  40  Gigabytes  of  data,  so  that  a  single 
datavault  could  hold  the  postings  for  a  database  of  1 20  Gigabytes.  A 1 024  Gigabyte  database  would 
require  eight  Datavaults. 

5.  The  Scoring  Algorithm 

.  Given  this  data  structure,  queries  can  be  executed  without  moving  data  between  nodes.  Fust,  each 
processor  allocates  and  zeros-out  one  mailbox  for  each  document  it  has  been  assigned^(Figure  20) . 
Second,  the  data  map  is  consulted  to  determine  which  partitions  need  to  be  loaded  into  primary 
memory.  The  tags  corresponding  to  each  term  are  then  determined  (Figure  21 ).  The  first  partition 
required  for  the  query  (partition  0)  is  then  loaded  into  primary  memory  (Figure  22).  Each  processor 
then  loops  through  its  postings,  looking  for  postmgs  with  the  correct  term  tag  (2  at  this  point).  When 

7.      On  the  CM  each  processor  is  assigned  1/32  of  the  documents  for  its  node. 
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one  is  found,  the  document  tag  is  used  to  determine  which  of  the  mailboxes  in  the  shared  memory  of 
the  node  is  to  be  accessed.  In  the  present  example,  a  processor  in  node  0  finds  a  matching  term  tag  in 
row  2.  This  posting  has  document  tag  0,  corresponding  to  mailbox  0.  At  this  point,  the  query  weight 
(3)  is  multiplied  by  the  document  weight  (.20),  and  the  product  used  to  increment  the  appropriate 
mailbox  (Figure  23).  Note  that,  because  duplicate  occurrences  of  a  term  within  a  document  are 
forbidden,  and  only  one  term  is  processed  at  a  time,  there  is  no  danger  of  two  processors  attempting  to 
access  the  same  mailbox. 
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Tenn 

Weight 

partition 

TeimlD 

Low  Bound 

Tag 

Document 

3 

0 

1 

2 
2 

0 

2 

2 
0 

This 

2 

4 

8 

8 

0 

Figure  21:  A  list  of  all  partitions  needed  for  a  query,  plus  data-map  information 
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Figure  22:  Postings  from  partition  0 
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Figure  23:  Mailboxes  after  first  partition  has  been  scored 

The  process  is  repeated  as,  in  Uim,  each  partition  is  loaded,  searched  for  postings  with  the  appropriate 
term  tag,  and  the  mailbox  addressed  by  the  document  tag  incremented.  Continuing  example,  partition 
1  would  be  loaded  and  searched  for  occurrences  of  term-tag  0  (Figure  24).  Two  such  occurrences  are 
found,  both  on  row  0  of  the  partition.  As  a  result,  the  scores  for  mailbox  1  in  node  0,  and  mailbox  0  in 
node  1  are  incremented  (Figure  25).  The  process  is  repeated  until  all  partitions  for  all  query-terms 
have  been  processed. 
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Figure  24:  Partition  1  is  the  second  partition  for  document 
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6.  The  Ranking  Algorithm 

The  output  of  the  above  algorithm  is  a  set  of  mailboxes  containing  document  scores.  It  is  now 
necessary  to  identify  andrank  the  documents  with  the  highest  scores.  The  following  algorithm,  due  to 
Jim  Hutchinson,  is  an  efficient  method  for  accomplishing  this  task.  This  algorithm  uses  the 
processor-wise  model  rather  than  the  node-wise  one. 

The  mailboxes  may  be  viewed  as  a  2-dimensional  array,  with  rows  corresponding  to  locations  within 
processor  memory  and  columns  corresponding  to  processors.  To  illustrate  this  aspect  of  the  system,  a 
new  example  will  be  used,  in  which  there  are  4  processors  and  16  docxmients,  yielding  4  mailboxes 
per  processor  (Figure  26). 
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Figure  26:  A  new  sample  problem,  with  16  mailboxes  in  4  processors 

The  first  step  is  to  determine  the  document  identifier  corresponding  to  each  mailbox.  As  noted  above , 
the  least-significant  bits  of  a  docimient  ED  select  a  node,  and  the  most  significant  bits  become  the 
docimient  tag,  which  is  used  to  select  a  mailbox  within  a  processor's  memory.  This  impUes  the 
correspondence  between  mailboxes  and  document  ID's  shown  in  Figure  27.  The  document  identifier 
is  then  appended  to  the  score,  producing  an  array  of  surrogates  (Figure  28).  Next,  the  largest  element 
in  each  row  is  determined  using  a  global  maximum  operation,  the  results  of  which  are  stored  on  the 
serial  host  As  these  maxima  are  computed,  they  are  stored  in  an  array  on  the  scalar  front-end  and  the 
original  data  is  zero'ed  out  (Figure  29).  The  maxima  are  then  converted  to  a  heap.  The  first  element 
of  the  heap  is  then  extracted.  In  this  case,  the  maximima  element  is  99.04,  which  indicates  that 
document  4  had  a  score  of  99.  The  maximum  for  its  row  is  then  re-computed  (Figiire  30). 
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Figure  27:  Document  Identifiers 
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Figure  28:  Surrogates  (score.docid) 
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Figure  29:  Maximal  surrogates 
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Figure  30:  Surrogates  after  first  document  is  extracted 

This  process  is  repeated  until  sufficient  docximents  have  been  found.  The  cost  is  one  global  maximum 
and  one  heap-insert  operation  per  row  of  mailboxes,  plus  one  global  maximiun,  one  heap  insert,  and 
one  heap-delete  operation  per  document  to  be  ranked.  The  time  required  for  these  operations  is 
dominated  by  the  time  to  compute  the  global-maxima. 

6.1  Performance 

The  performance  of  the  ranking  algorithm  was  determined  by  the  following  procedure: 

1 )  The  number  of  documents  is  determined  by  dividing  the  size  of  the  database  by  the  average 
docimient  size  (5000  bytes). 

2)  A  sufficient  number  of  mailboxes  are  allocated. 

3)  The  mailboxes  are  filled  with  random  numbers. 

4)  The  20  highest-ranking  scores  are  extracted  from  the  machine. 

For  databases  between  1  and  1024  Gigabytes  and  a  256  node  machine,  the  times  are  as  reported  in 
Table  5.  Again,  database  sizes  corresponding  to  larger  machines  are  shown. 


File  Size,  Gigabytes 

1 

2 

4 

8 

16 

32 

64 

128 

256 

512 

1024 

Equivalent  on  512  nodes 

2 

4 

8 

16 

32 

64 

128 

256 

512 

1024 

2048 

Equivalent  on  1024  nodes 

4 

8 

16 

32 

64 

128 

256 

512 

1024 

2048 

4096 

Equivalent  on  2048  nodes 

8 

16 

32 

64 

128 

256 

512 

1024 

2048 

4096 

8192 

Ranking  Time  (msec) 

27 

37 

53 

80 

132 

278 

410 

781 

1504 

3000 

6000 

Table  5:  Estimated  times  for  ranking  documents  after  scoring 


The  largest  database  involves  a  8192  Gigabytes  of  data,  which  would  contain  1 .6  bilUon  documents. 
The  ranking  time  of  6  seconds  on  a  2048  node  machine  thus  corresponds  to  267  million  mailboxes  per 
second.  The  Sun  4/330  delivered  a  performance  of  .91  million  mailboxes  per  second,  so  the  parallel 
algorithm  seems  to  deliver  293  times  higher  performance.  These  measurements  were  performed  on  a 
machine  having  256K  bits  per  processor,  rather  than  the  maximal  configuration  of  IM  bits  per 
'  processor.  This  provided  insufficient  mailbox  space  for  the  512-  and  1024-  Gigabyte  databases. 
Times  for  those  databases  were,  therefore,  extrapolated. 

7.  Secondary  Storage 

Before  discussing  the  adaptation  of  partitioned  posting  files  to  secondary  storage,  it  is  useful  to 
understand  the  architecture  of  disk  systems  for  parallel  computers.  Paterson  (1988)  has  suggested 
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tliat,  in  order  to  support  the  high  I/O  rates  required  by  parallel  computers  while  maintaining  a  high 
degree  of  reliabiUty ,  it  is  sufficient  to  construct  disk  arrays  from  large  nvimbers  of  inexpensive  disks, 
provided  redundant  information  is  stored  to  guard  against  media  failures.  Given  n  disks,  each  with  a 
transfer  rate  of  T,  one  may  construct  a  disk  array  with  a  transfer  rate  of  «r,  which  is  used  as  if  it  were  a 
single  disk  (a  technique  called  disk  striping  by  Salem  ( 1 986) .  The  Connection  Machine  has  available 
a  disk  array  called  the  Datavault,  which  has  either  32  or  64  data  drives,  and  supports  a  transfer  rate  of 
up  to  25  megabytes  per  second.  In  order  to  guard  against  loss  of  data  due  to  drive  failure,  disk  arrays 
generally  contain  some  extra  disk  units  and  keep  a  certain  amoxmt  of  redundant  information,  but  this 
is  transparent  as  far  as  the  user  is  concerned.  To  provide  still  higher  transfer  rates,  up  to  8  Data  vaults 
may  be  combined  in  parallel,  providing  transfer  rates  as  high  as  200  megabytes  per  second.  The  I/O 
system  still  appears  to  the  user  as  a  single  disk  unit  with  a  very  high  transfer  rate.  The  number  of 
simultaneously  active  Datavaults  will  be  referred  to  as  the  striping  factor. 

The  read-  and  write-  operations  on  a  disk  array  will  typically  transfer  information  to/from  all 
processors  at  once,  moving  data  to/from  the  same  location  in  each  processor  (Figure  31).  This  is  well 
suited  to  the  partitioned  posting  fUe  representation  described  above:  in  a  single  I/O  operation,  it  is 
possible  to  transfer  a  contiguous  set  of  partitions  into  CM  memory.  Thus,  the  query  processing 
strategy  is  to  1)  determine  which  partitions  are  needed;  2)  transfer  only  those  partitions  from 
secondary  to  primary  storage;  and  3)  use  the  scoring  and  ranking  algorithms  as  described  above. 
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Figure  31:  Striped  disk  access 

7.1  Performance 

The  Datavault  has  a  latency  of  200  milliseconds  and  a  transfer  rate  of  25  megabytes  per  second.  It  is 
possible  to  overlap  computation  with  latency,  but  not  with  transfers.  The  performance  of  the  system 
may  be  determined  as  follows: 

1 )  Determine  the  average  size  of  a  partition  in  bytes  by  multiplying  the  posting-size  in  bits  (q. v. 
Table  3)  by  the  mmiber  of  slots  per  partition  (32,768). 

2)  Multiply  this  by  the  average  mmiber  of  partitions  per  query-term  (q.v.  Table  4). 

3)  Determine  the  transfer  time  by  dividing  this  figure  by  the  transfer  rate  of  25  MB/second. 

4)  The  latency  is  always  200  miUiseconds. 

Latency  and  transfer  times  are  estimated  for  databases  between  1  and  1024  Gigabytes  (Table  6), 
assuming  a  striping  factor  of  1 .  Corresponding  times  for  striping  factors  of  2, 4,  and  8  are  also  shown. 
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It  is  immediately  apparent  that,  for  databases  of  moderate  size  (1-128  Gigabytes),  the  latency  dwarfs 
the  transfer  time.  A  mediod  for  improving  on  this  situation  is  clearly  desirable. 


File  Size,  Gigabytes 

1 

2 

4 

8 

16 

32 

64 

128 

256 

512 

1024 

Equivalent  with  stripe  =  2 

2 

4 

8 

16 

32 

64 

128 

256 

512 

1024 

2048 

Equivalent  with  stripe  =  4 

4 

8 

16 

32 

64 

128 

256 

512 

1024 

2048 

4096 

Equivalent  with  stripe  =  8 

8 

16 

32 

64 

128 

256 

512 

1024 

2048 

4096 

8192 

Posting  Size  in  Bits 

35 

35 

35 

36 

36 

36 

37 

37 

37 

38 

39 

Partition  Size,  MB 

.143 

.143 

.143 

.147 

.147 

.147 

.152 

.152 

.152 

.156 

.160 

Paitition/remi 

1.1 

1.2 

1.4 

1.9 

2.7 

4.3 

7.5 

13.6 

26.0 

50.5 

100.0 

Transfer/Tenn  (MB) 

.157 

.172 

.200 

.280 

.397 

.632 

1.14 

2.07 

3.95 

7.89 

16.0 

Transfer  Time^erm  (msec) 

6 

7 

8 

11 

16 

25 

46 

83 

158 

316 

640 

Latency/Term  (msec) 

200 

200 

200 

200 

200 

200 

200 

200 

200 

200 

200 

Table  6:  I/O  Behavior  for  striped  datavaults 


7.2  Independent  Disk  Access 

The  Data  Vault  contains  a  large  number  of  disks  which,  if  harnessed,  ought  to  provide  a  large  number 
of  I/O 's  per  second,  effectively  reducing  the  latency.  This  is  currently  supported  as  an  experimental 
feature  of  the  Datavault.  In  this  independent  disk  access  mode,  the  Datavault  must  be  thought  of  as  a 
2-dimensional  array  of  disk  blocks,  where  columns  correspond  to  the  individual  disks  and  rows 
correspond  to  disk-addresses.  In  one  primitive  independent  read  operation,  it  is  possible  to  read  one 
block  from  each  column  (drive).  The  read  operation  will  transfer  information  to  all  processors  at 
once,  spreading  a  disk-block  evenly  across  all  processors  (Figure  32).  Ultimately,  some  form  of  error 
correction/redundancy  will  be  needed,  such  as  one  of  Paterson's  RAID  schemes  (1988). 
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Figure  32:  Independent  Disk  Mode 

The  partitioned  posting  file  is  mapped  to  this  disk  array  in  row-major  order,  so  that  row  /,  coliram; 
corresponds  to  partition;  +  i*  32  (assuming  there  are  32  disks). 
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The  algorithm  for  accessing  secondary  storage  is  as  follows: 

1)  When  a  query  is  received,  the  partitions  needed  to  service  the  query  are  located. 

2)  These  partitions  are  mapped  to  rows  and  columns  in  the  disk  array. 

3)  This  set  of  row-column  requests  is  given  to  the  disk  system,  which  parcels  them  out  among 
the  32  disks,  performing  as  many  primitive  independent-disk  reads  as  are  necessary  to 
retrieve  the  needed  information. 

4)  As  each  set  of  32  partitions  is  read  into  memory,  they  are  scored  ia  accordance  with  the 
algorithm  in  Section  5. 

The  workloads  assigned  to  the  32  disks  will  not  be  uniform.  For  example,  if  32  partitions  are  to  be 
retrieved  from  32  disks,  it  is  likely  that  several  disks  will  have  none  of  the  needed  partitions,  while  one 
disk  might  have  3  requests.  The  maximiun  number  partitions  mapping  to  a  single  disk  will  be  called 
the  depth  of  the  query;  it  is  the  depth  that  governs  lie  number  of  primitive  independent  disk  reads, 
hence  the  time  required  to  service  the  query.^ 

7.3  Performance  with  independent  disk 

It  takes  100  milUseconds  to  initiate  independent  disk  access  mode.  Once  this  has  been  done,  each 
primitive  independent  disk  read  has  an  additional  latency  of  125  milliseconds,  after  which  data  will 
flow  at  25  megabytes/second.  After  data  has  been  transferred,  the  CM  requires  18  milliseconds  of 
compute  time  per  megabyte  of  data  (assuming  a  256  node  system)  to  rearrange  the  data.  The  latency 
period  may  be  overlapped  with  the  rearrangement  period,  and  with  arbitrary  computations.  The 
transfer  period  may  not  be  overlapped  with  anything.  The  following  computation  establishes  the 
basic  performance  characteristics  of  independent  disk  access  for  partitioned  posting  files: 

a.  Maximum  Posting  Size:  40  bits  (5  bytes) 

b.  Partition  capacity  (256  node  system):  32,768  postings 

c.  Partition  Size  (a  *  b)  163,840  bytes 

d.  Transfer  Size  (32  *  c)  5.24  megabytes 

e.  Transfer  Rate  25  megabytes/second 

f.  Transfer  Time  (d/e)  210  msec 

g.  Rearrangement  Tune  Factor  1 8  msec  /  megabyte 
h.  Rearrangement  Time  (g  *  d)  94  msec 

i.  Time  to  score  1  partition  .695  msec 

j.  Time  to  score  32  partitions  (32  *  i)  22  msec 

k.  Rearrangement  +  scoring  time  (h  +  j)  116msec 

1.  Latency  125  msec 

m.  Per-depth  time  (f  +  max(l,  k))  335  msec 

■^  Accessing  a  batch  of  32  partitions  thus  involves  210  msec  of  transfer  time  plus  125  msec  of  latency, 
which  can  be  exacfly  overlapped  with  rearrangement  and  scoring,  for  a  total  per-phase  time  of  335 
milliseconds.  It  remains  only  to  determine  the  depth  of  the  average  query.  This  was  done  by  the 
following  procedure: 

9.      A  consequence  of  mapping  partitions  to  disks  in  round-robin  order  is  that,  in  cases  where  a  term  involves  multiple 
partitions,  those  partitions  are  guaranteed  to  map  to  different  disks,  a  property  which  reduces  load  imbalance. 
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1) 

2) 

3) 
4) 

5) 

6) 


A  query  is  created  by  selecting  either  1 0  or  30  terms  at  random,  according  to  the  query-term 
distribution  described  above. 

The  required  nxunber  of  partitions  required  to  service  the  request  is  determined  by  die 
procedxire  previously  outlined. 

The  disk  of  the  first  partition  for  each  query  term  is  determined  at  random. 

If  a  given  query-term  requires  multiple  partitions,  those  partitions  are  assigned  to  subsequent 
disks  in  roimd-robin  order,  as  described  above. 

The  maximimi  query-depth  (greatest  nimiber  of  partitions  mapping  to  a  given  disk)  is 
determined. 


Steps  1-4  are  repeated  until  enough  queries  have  been  generated  to  yield  statistically 
significant  results. 

The  results  are  shown  in  Table  7.  In  this  case,  results  do  not  necessarily  scale  uniformly  with  system 
size;  it  is  not  clear  that  an  8  GB  database  on  a  system  with  8  Datavaults  and  2048  nodes  has  the  same 
performance  characteristics  as  a  1  GB  database  with  1  Datavault  and  256  nodes. 


File  Size,  Gigabytes 

1 

2 

4 

8 

16 

32 

64 

128 

256 

512 

1024 

Query  Depth,  10  terms 
I/O  +  Score  Time  (sec) 

1.9 
.6 

1.9 
.6 

2.0 
.7 

2.2 
.7 

2.5 
.8 

3.1 
1.0 

4.0 
1.3 

6.0 
2.0 

9.6 

3.2 

16.4 
5.5 

31.6 
10.6 

Query  Depth,  30  terms 
I/O  +  Score  Tmie  (sec) 

3.4 
1.1 

3.4 
1.1 

3.7 
1.2 

4.1 
1.4 

5.1 
1.7 

6.7 

2.2 

9.4 
3.1 

15.0 
5.0 

26.3 
8.8 

48.1 
16.1 

91.7 
30.7 

Table  7:  I/O  plus  score  times  for  independent  disk  access  system 

8.  Retrieval  System  Architectures 

It  is  now  time  to  combine  the  various  results  from  other  sections  of  this  paper,  and  propose  some 
architectures.  Three  basic  architectures  will  be  presented: 

1)  A  main-memory  architecture  (no  secondary  storage  access  at  nm-time). 

2)  A  disk-resident  database  using  independent  disk  access. 

3)  A  disk-resident  database  using  stiiped  access. 

As  will  be  seen,  these  systems  are  suitable  for  use  on  progressively  larger  databases. 

8.1  The  main-memory  architecture 

The  main-memory  database  is  limited  by  the  capacity  of  primary  memory.  Depending  on  the  size  of 
'machine  employed,  memory  could  be  between  .5  and  8  Gigabytes.  As  noted  above,  the  partitioned 
posting  file  is  30%  the  size  of  the  text  file.  Thus,  on  tiie  largest  available  machine  databases  up  to  24 
Gigabytes  may  be  stored,  leaving  10%  of  memory  free  for  scratch  space  and  mailboxes.  For  this 
architecture,  the  number  of  processors,  hence  tiie  available  compute-power,  is  proportional  to  the 
size  of  the  database,  and  response-times  are  constant.  Because  no  I/O  is  needed,  response  is 
exti-emely  fast.  This  architecture  is  atttactive  for  a  centi'al  server  shared  by  a  large  number  of 
subscribers,  accessing  databases  of  modest  size  but  great  value.    As  die  price  of  semiconductor 
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memory  continues  to  plummet,  this  architecture  will  become  increasingly  attractive.     This 
architecture  is  summarized  in  Table  8. 


File  Size,  Gigabytes 

1.5 

3 

6 

12 

24 

Posting  FUe  Size  (GB) 

.45 

.9 

1.8 

3.6 

7.2 

Machine  Size  (nodes) 

128 

256 

512 

1024 

2048 

Machine  Size  (procs) 

4K 

8K 

16K 

32K 

64K 

Main  Memory  (GB) 

.5 

1 

2 

4 

8 

Scratch  Space  (GB) 

.05 

.1 

.2 

.4 

.8 

Documents  in  DB  (million) 

.3 

.6 

1.2 

2.4 

4.8 

Memory  for  Mboxes  (GB) 

.001 

.002 

.(XM 

.0{« 

.016 

Time/query-term  (sec) 

.001 

.001 

.001 

.001 

.001 

Tmie/10  terms  (sec) 

.010 

.010 

.010 

.010 

.010 

Time/30  terms  (sec) 

.030 

.030 

.030 

.030 

.030 

Rank  Tune  (sec) 

.045 

MS 

.045 

.045 

.045 

Total  Time/lO  terms  (sec) 

.055 

.055 

.055 

.055 

.055 

Total  rmie/30  terms  (sec) 

.075 

.075 

.075 

.075 

.075 

Table  8:  Performance  characteristics  of  main-memory  architecture 

Table  2  predicts  a  tkne  of  .449  seconds  for  alO  term  query  on  a  1  Gigabyte  database,  using  a  Sun 
4/330,  and  a  time  of  .909  seconds  for  a  30  term  query  on  the  same  database.  Extrapolatmg  to  a  24 
Gigabyte  database,  one  gets  times  of  10.8  seconds  and  21.8  seconds.  Dividing  by  the  predicted 
performance  of  the  mam-memory  architecture,  on  sees  a  performance  gain  of  1 95  in  the  first  case  and 
316  in  the  second. 

8.2  The  Independent  Disk  Architecture 

For  databases  which  are  of  moderate  size,  and  either  too  large  to  fit  into  the  memory  of  any  machine  or 
not  accessed  sufficiently  often  to  justify  the  expense  of  buying  a  large  machine,  a  system  using 
independent  disk  access  is  appropriate.  Databases  between  4  and  128  Gigabytes  can  be  comfortably 
handled  by  a  system  with  a  single  40  Gigabyte  datavault  and  a  smgle  8K-processor  (256  node) 
Connection  Machine.  The  I/O  and  scoring  times  for  this  system  have  akeady  been  deduced;  it 
remains  only  to  add  in  the  ranking  times.  The  characteristics  of  this  system  are  summarized  in 
Table  9. 
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Rle  Size,  Gigabytes 

4 

8 

16 

32 

64 

128 

Posting  File  Size,  GB 

1.2 

2.4 

4.8 

9.6 

19.2 

38.4 

Machine  Size  (nodes) 

256 

256 

256 

256 

256 

256 

Machine  Size,  Procs 

8K 

8K 

8K 

8K 

8K 

8K 

Datavaults 

1 

1 

1 

1 

1 

1 

Query  Depth,  10  terms 

2.0 

2.2 

2.5 

3.1 

4.0 

6.0 

Scoring  Time  (sec) 

.7 

.7 

.8 

1.0 

1.3 

2.0 

Queiy  Depth,  30  tenns 

3.7 

4.1 

5.1 

6.7 

9.4 

15.0 

Scoring  Time  (sec) 

1.2 

1.4 

1.7 

2.2 

3.1 

5.0 

Ranking  Tune  (sec) 

.053 

.080 

.132 

.228 

.410 

.781 

Total  Time,  10  temis  (sec) 

.8 

.8 

.9 

1.2 

1.7 

2.8 

Total  lime,  30  terms  (sec) 

1.3 

1.5 

1.8 

2.4 

3.5 

5.8 

Table  9:  Performance  characteristics  of  independent  disk  architecture 


83  The  Striped  Disk  Architecture 

Once  the  size  of  the  database  exceeds  128  Gigabytes,  transfers  become  large  enough  to  justify  the  use 
of  striped  disk  mode,  which  optimizes  for  transfer  rate  at  the  expense  of  latency.  As  the  database 
grows,  more  Datavaults  and/or  more  processors  are  added.  It  is  assumed  that  the  striping  factor 
equals  the  number  of  8KP  CM  segments  Ln  the  system,  so  that  a  64KP  system  would  support  a  striping 
factor  of  8.  Additional  datavaults  are  added  as  necessary  to  provided  additional  storage  capacity.  For 
a  given  size  database,  there  is  generally  a  tradeoff  between  the  nimiber  of  processors  (hence  the 
possible  striping  factor)  and  the  system  performance.  Times  for  10-term  queries  are  shown  in 
Table  10;  times  for  30-term  queries  will  be  correspondingly  longer.  In  all  cases,  the  time  to  score  a 
query-term  is  much  smaller  than  the  latency,  so  the  time  to  score  a  query-term  is  simply  equal  to  the 
latency  time  plus  the  transfer  time. 
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File  Size,  Gigabytes 

128 

256 

512 

1024 

256 

512 

1024 

2048 

Posting  File  Size,  GB 

38 

76 

152 

304 

76 

152 

304 

608 

Machine  Size  (nodes) 

256 

512 

1024 

2048 

256 

512 

1024 

2048 

Machine  Size  (piocs) 

8K 

16K 

32K 

64K 

8K 

16K 

32K 

64K 

Datavaults 

1 

2 

4 

8 

2 

4 

8 

16 

Transfer  Time  (sec) 

.083 

.083 

.083 

.083 

.158 

.158 

.158 

.158 

Latency  (msec) 

.200 

.200 

.200 

.200 

.200 

.200 

.200 

.200 

Scoring  lune/Temi  (sec) 

.009 

.009 

.009 

.009 

.018 

.018 

.018 

.018 

Query  Tmie/Term  (sec) 

.283 

.283 

.283 

.283 

.358 

358 

.358 

.358 

Query  Tlme/lO  Terms  (sec) 

2.8 

2.8 

2.8 

2.8 

3.6 

3.6 

3.6 

3.6 

Ranking  Time  (sec) 

.8 

.8 

.8 

.8 

1.5 

1.5 

1.5 

1.5 

Total  Time/10  Terms  (sec) 

3.6 

3.6 

3.6 

3.6 

5.1 

5.1 

5.1 

5.1 

File  Size,  Gigabytes 

512 

1024 

2048 

4096 

1024 

2048 

4096 

8192 

Posting  File  Size,  Gigabytes 

152 

304 

608 

1216 

304 

608 

1216 

2432 

Machine  size  (nodes) 

256 

512 

1024 

2048 

256 

512 

1024 

2048 

Machine  size  (procs) 

8K 

16K 

32K 

64K 

8K 

16K 

32K 

64K 

Datavaults 

4 

8 

16 

32 

8 

16 

32 

64 

Transfer  Time  (sec) 

.316 

.316 

.316 

.316 

.640 

.640 

.640 

.640 

Latency  (msec) 

.200 

.200 

.200 

.200 

.200 

.200 

.200 

.200 

Scoring  Time/Tenn  (sec) 

.035 

.035 

.035 

.035 

.068 

.068 

.068 

.068 

Query  Time/Term  (sec) 

.516 

.516 

.516 

.516 

.840 

.840 

.840 

.840 

Query  Tmie/10  Terms  (sec) 

5.2 

5.2 

5.2 

5.2 

8.4 

8.4 

8.4 

8.4 

Ranking  Time  (sec) 

3.0 

3.0 

3.0 

3.0 

6.0 

6.0 

6.0 

6.0 

Total  Time/lO  Terms 

8.2 

8.2 

8.2 

8.2 

12.4 

12.4 

12.4 

12.4 

Table  10:  Performance  characteristics  of  striped  disk  architecture 


9.  Discussion 

Figure  33  summarizes  the  various  architectural  options,  in  terms  of  response  time  and  database  size. 
First,  it  should  be  noted  that  one  of  the  three  architectures  is  appropriate  to  almost  any  database  that 
can  be  unagined,  and  that  reasonable  response  times  are  possible  for  databases  of  essentially 
unlimited  size.  In  an  architectural  sense,  document  ranking  can  thus  be  considered  a  solved  problem. 
Doubtless  there  is  room  for  improvement,  hideed,  the  above  architectures  go  far  beyond  the  actual 
demands  of  the  day;  text  databases  of  more  than  a  few  hundred  gigabytes  are  currently  unknown. 
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Figure  33:  Summary  of  Architectures 

The  algorithms  described  above  are  modest  in  the  demands  they  make  on  processor  architecture. 
Very  little  communication  is  needed  at  run-time.  A  serial  host  must  broadcast  a  small  amount  of 
information  (term  tags,  query-term  weights)  to  the  processing  elements  every  time  a  new  partition  is 
processed.  The  ranking  algorithm  primarily  consists  of  global-maximum  operations.  The  vast 
majority  of  interprocessor  communication  takes  place  when  the  partitioned  posting  file  is 
constructed;  further  work  is  needed  to  determine  how  much  communication  is  actually  required.  The 
processing  elements  themselves  must  support  indirect  addressing;  this  is  not  a  problem  with  most 
contemporary  parallel  computers. 

There  is,  however,  considerable  need  for  additional  work.  There  remain  some  straight-forward 
algorithmic  issues.  For  example,  the  problem  of  building  and  updating  these  databases  has  not  yet 
been  fully  addressed;  this  is  currently  being  studied.  There  also  remain  some  substantial 
technological  issues  relating  to  storage  technology.  First,  the  main  memory  database  architecture, 
with  its  near-instant  response  times,  is  highly  desirable,  but  is  restricted  by  current  memory 
technology,  which  makes  the  assembly  of  a  Connection  Machine  system  having  more  than  8 
'' Gigabytes  of  semiconductor  memory  impossible  at  this  time.  There  is  no  doubt  that,  with  the 
continuing  rapid  evolution  of  memory  technology,  this  limitation  will  soon  be  overcome.  Within  this 
decade,  databases  of  up  to  1 28  Gigabytes  should  fit  into  primary  memory.  There  is  also  need  for  more 
work  on  disk  technology  and  disk  array  technology.  At  this  writing,  no  Connection  Machine  has  been 
configured  with  more  than  3  Data  Vaults.  The  largest  configuration  proposed  above  requires  64 
Datavaults;  the  logistics  of  managing  such  a  huge  amount  of  storage  remain  unsolved.  In  addition,  the 
cost  of  secondary  storage  remains,  for  the  largest  databases,  a  considerable  problem.  For  an 
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organization  that  really  needed  access  to  a  database  of  8  Terabytes,  the  cost  of  a  64K-^rocessor 
Connection  Machine  should  not,  in  practice,  be  a  major  obstacle,  but  the  cost  of  storage  media  would 
likely  prove  prohibitive.  However,  as  optical  and  other  low-cost,  high-volume  media  mature,  it  is 
likely  that  this  problem  will  be  overcome.  Finally,  for  databases  of  intermediate  size,  the  latency  and 
bandwidth  characteristics  of  the  DataVault,  even  with  independent  disk  access,  remain  a  limiting 
factor;  improvements  m  these  areas  might  prove  quite  helpftil. 

The  above  problems  are  either  algorithmic  or  technological,  and  will  certainly  be  solved.  The 
primary  problems  at  this  point  are  those  relating  to  the  science  of  information  retrieval,  in  particular 
automatic  indexing.  Parallel  computing  technology  has  evolved  to  the  point  that  searching  a  128 
Gigabyte  database  does  not  present  a  significant  technological  problem.  Predicting  the  quality  of 
such  a  search,  however,  remains  a  considerable  scientific  challenge.  The  abiUty  of  serial  machines  to 
collect  and  search  text  has  ahready  strained  the  manual  methods  of  library  science.  With  parallel 
computing,  the  quantities  of  data  that  can  be  collected  and  searched  are  so  staggering  that  there  is  no 
hope  whatsoever  of  manually  indexing  it,  leaving  no  alternative  to  full-text  databases  and  automatic 
information  retrieval.  However,  the  majority  of  research  in  information  retrieval  has  been  conducted 
on  tiny  (<  100  Megabyte)  databases  of  abstracts  from  scientific  publications.  The  technology  of  text 
databases  has  thus  outstripped  the  science  of  text  databases  by  five  orders  of  magnitude.  This  is  a 
situation  which,  given  suitable  funding  of  basic  research  in  DR.,  might  be  remedied,  but  in  the  absence 
of  adequate  support  the  application  of  the  technology  presented  in  this  paper  may  be  lunited  by  a  lack 
of  knowledge  of  how  to  apply  it. 
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